1 Summary of contents
2 Directory organization
The chunk below describes
3 Introduction
The analysis is split into multiple parts.
- A database-dependent mapping and community profiling at the SDP and Phylotype level.
- An assembly-based approach which involes MAG binning per sample and dereplication. Requires manual annotation of input for the next section
- Core genome phylogeny construction of MAGs and isolate genomes.
- All high-quality MAGs
- One MAG per magOTU cluster
- Mapping and SNV calling of reads to
- Just MAGs
- A combination of isolates and MAGs
The samples used in this analysis are mentioned below.
- “M1.1”, “M1.2”, “M1.3”, “M1.4”, “M1.5”,
- Apis mellifera from India
- “DrY2_F1”,“DrY2_F2”,
- Apis mellifera from Switzerland
- “AmAi02”,“AmIu02”,
- Apis mellifera from Japan
- “C1.1”, “C1.2”, “C1.3”, “C1.4”, “C1.5”,
- Apis cerana from India, colony number 1
- “C2.1”, “C2.2”, “C2.3”, “C2.4”, “C2.5”,
- Apis cerana from India, colony number 2
- “C3.1”, “C3.2”, “C3.3”, “C3.4”, “C3.5”,
- Apis cerana from India, colony number 1
- “AcCh05”,“AcKn01”,
- Apis cerana from Japan
- “D1.1”,“D1.2”,“D1.3”,“D1.4”,“D1.5”,
- Apis dorsata from India, colony number 1
- “D2.1”,“D2.2”,“D2.3”,“D2.4”,“D2.5”,
- Apis dorsata from India, colony number 2
- “D3.1”,“D3.2”,“D3.3”,“D3.4”,“D3.5”,
- Apis dorsata from India, colony number 3
- “F1.1”,“F1.2”,“F1.3”,“F1.4”,“F1.5”,
- Apis florea from India, colony number 1
- “F2.1”,“F2.2”,“F2.3”,“F2.4”,“F2.5”,
- Apis florea from India, colony number 2
- “F3.1”,“F3.2”,“F3.3”,“F3.4”,“F3.5”
- Apis florea from India, colony number 3
3.1 Data summary
50 samples were first mapped to a host database and then the unmapped reads to a microbiome database.
The raw reads and trimmed reads were checked for quality using the
tool fastqc.
The report (html format) also summarises basic statistics including
number of reads. The QC results can be found in their respective folders
at ./fastqc/raw/{SAMPLE}_R*_fastqc.html and
./fastqc/trim/{SAMPLE}_R*_trim_fastqc.html.
4 Data description
The dataset comprises samples from 4 species of honey bees sampled in India.
- Apis mellifera (5 individuals from 1 colony)
- Apis cerana (5 individuals from 3 colonies)
- Apis dorsata (5 individuals from 3 colonies)
- Apis florea (5 individuals from 3 colonies)
Some samples from older publications of the lab are also included for Apis mellifera and Apis cerana
The data is stored in various locations as described below and backed up on the NAS.
Raw data:
NAS recerche:
/nas/FAC/FBM/DMF/pengel/spirit/D2c/aprasad/211102_Medgenome_india_samples_resequenced/nas/FAC/FBM/DMF/pengel/spirit/D2c/aprasad/211018_Medgenome_india_samples(cluster - aprasad@curnagl.dcsr.unil.ch)NAS:
/home/aiswarya/mnt/aprasad/SPIRIT_Project/Data/RawData/211018_Medgenome_india_samples.tar.gz/home/aiswarya/mnt/aprasad/SPIRIT_Project/Data/RawData/211102_Medgenome_india_samples_resequenced.tar.gz/home/aiswarya/mnt/lab_resources/NGS_data/20211018_A01223-105-HC32VDSX2//home/aiswarya/mnt/lab_resources/NGS_data/20211102_A01223-105-HC32VDSX2/(workstation - aiswarya@130.223.110.124)
Trimmed data:
/work/FAC/FBM/DMF/pengel/spirit/aprasad/211018_Medgenome_india_samples/01_Trimmed//work/FAC/FBM/DMF/pengel/spirit/aprasad/211102_Medgenome_india_samples_resequenced/01_Trimmed/(cluster - aprasad@curnagl.dcsr.unil.ch)Working directory backup: (need to keep up-to-date using script on the cluster
bash ~/backup_workdir.shand logs are writted to ~/yymmdd_backup_log on the cluster)/home/aiswarya/mnt/aprasad/Backups/working_dir_backup/Cluster/211102_Medgenome_india_samples_resequenced/home/aiswarya/mnt/aprasad/Backups/working_dir_backup/Cluster/211018_Medgenome_india_samples(workstation - aiswarya@130.223.110.124)Results and important intermediate files:
/home/aiswarya/mnt/aprasad/SPIRIT_Project/Data/211018_Medgenome_india_analysis(workstation - aiswarya@130.223.110.124)Conda installation (cluster):
/work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3(cluster - aprasad@curnagl.dcsr.unil.ch)
4.0.1 Nomenclature
There are 56 samples at the moment.
- M1.1 - M1.5 are 5 individuals of Apis mellifera from colony 1
- Cx.1 - Cx.5 are 5 individuals of Apis cerana from colony x for 3 colonies (1 - 3)
- Dx.1 - Dx.5 are 5 individuals of Apis dorsata from colony x for 3 colonies (1 - 3)
- Fx.1 - Fx.5 are 5 individuals of Apis florea from colony x for 3 colonies (1 - 3)
- DrY2_F1 and DrY2_F2 are samples from KE’s 2015 paper. Apis mellifera from switzerland (Les Droites)
- AcCh05, AcKn01 and AmAi02, AmIu02 are two samples of Apis cerana and Apis mellifera from different apiaries in Japan
These samples were from earlier runs:
- 20151119_WINDU89 20151119 Kirsten_Ellegaard 6 GTF
Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee
gut microbiota (2019, NatCom) Nurses, Year 1, Les Droites
20160415_OBIWAN225 20160415 Kirsten_Ellegaard 12 GTF Illumina 100 PE
HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota
(2019, NatCom) Foragers/Winterbees, Year 1, Les Droites
- 20161216_OBIWAN275 20161216 Kirsten_Ellegaard 6 GTF
Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee
gut microbiota (2019, NatCom) Nurses, Year 2, Les Droites
- 20170310_WINDU179 20170310 Kirsten_Ellegaard 12 GTF
Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee
gut microbiota (2019, NatCom) Foragers/Winterbees, Year 2, Les Droites
(INCLUDED FOR NOW)
- 20170426_OBIWAN300 20170426 Kirsten_Ellegaard 6 GTF
Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee
gut microbiota (2019, NatCom) Nurses, Year 2, Grammont
- 20170428_WINDU191 20170428 Kirsten_Ellegaard 12 GTF
Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee
gut microbiota (2019, NatCom) Foragers/Winterbees, Year 2,
Grammont
- 20180118_OBIWAN338-339 20180118 Kirsten_Ellegaard
30 GTF Illumina 100 PE HiSeq 2500 Metagenomes of individual honey bees,
subjected to dietary manipulation and kept in the lab
- 20180612_KE_japan_metagenomes 20180612 Ryo_Miyasaki 40 Japan Illumina 100 PE HiSeq 2500 Vast differences in strain-level diversity in two closely related species of honey bees (2020, CurBiol) Sampling and sequencing done in Japan (INCLUDED FOR NOW)
4.1 Databases
4.1.1 Host database
The database is named 4_host_db.
A paper published in Dec. 2019 a high quality Apis dorsata genome as an improvement over a previous submission in 2013. The paper also mentioned studies that had previously sequenced the Apis florea genome in 2012, Apis cerana genome in 2015 (other assemblies submitted found here) and Apis mellifera genome in 2018 (other assemblies submitted listed here). So far I have not found any whole genome assemblies of Apis adreniformis.
These assemblies were downloaded and concatenated to make the 4_host_db. It contains,
>apis_mellifera_2018 PRJNA471592 version Amel_Hac3.1>Apis_cerana PRJNA235974>Apis_cerana_mitochondrion PRJNA235974>Apis_florea PRJNA45871>Apis_dorsata PRJNA174631
4.1.2 Microbiome database
The database is named genome_db_210402.
This database was set up by Dr Kirsten Ellegard (KE) and is described on zenodo. It uses NCBI and IMG genome assemblies. It is non-redundant and contains concatenated genomes. Located at in lab NAS directory at lab_resources/Genome_databse. In this pipeline so far, the version of the pipeline set up by KE’s community profiling pipeline.
It was downloaded by the script download.py --genome_db
from zenodo. This
dowloads multiple directories. The Orthofider directory can be deleted
as this is generated for the pipeline as needed. The bed files can be
generated from gff files if desired but this was already done for the
genomes of that database so was not repeated. The other files (ffn, gff)
are found in the public repository from where the genome was downloaded.
The faa files were reorganised in directories corresponding to their
respective SDPs in order to allow the Orthofinder scripts to assign
orthogroups per SDP.
These repositories follow their own annotation pipeline to generate
these files. The database can also be found at
<NAS>/lab_resources/Genome_database/database_construction.
It contains 198 genomes identified by their locus tags and described in
<NAS>/lab_resources/Genome_database/database_construction/database_construction
in metadata sheets.
5 Description of pipeline methods
Run entire snakemake pipeline using:
snakemake -p --use-conda --conda-prefix /work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs --conda-frontend conda --profile slurm --restart-times 0 --keep-going
and if resuming a failed or stopped run, use:
snakemake -p --use-conda --conda-prefix /work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs --conda-frontend conda --profile slurm --restart-times 0 --keep-going --rerun-incomplete
conda environments are all specified in envs/ and built
by snakemake under various names in
/work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs
Run the pipeline in the conda environment called
snakmake_with_samtools in the cluster. It is a clone of the
snakemake environment made as recommended by Snakemake docs
followed by conda install biopython and later
conda install samtools in it. This is so that Kirsten’s
core_cov script works (specific conda environments can only be specified
for rules using bash).
5.1 Description of directory structure
Directory names are largely self-explanatory.
./00_rawdata,./01_Trimmed,./02_HostMapping,./03_MicrobiomeMappingdatabasecontains databases to be used for mapping. It also contains./Orthofinderfiles. These are described later in the sections describing associated rules../envscontains all yaml files required for this pipeline. They contain a list of packages needed to specify the conda environment for various rules to work within../logscontains log files./scriptscontains all scripts needed for the snakemake pipeline. Many of these scripts are adapted from Kirsten’s scripts from the zenodo directories, github or from the lab_resources directories. The results of the core coverage estimation are stored in,./04_CoreCov_211018_Medgenome_india_samples./07_SNVProfilingis not fully implemented (yet) for these samples as it is not relevant at this time../fastqccontains fastqc results for trimmed and raw files + bamfile_list_red.txt - required by KE’s core coverage pipeline + bamfile_list.txt - required by KE’s core coverage pipeline + Adapters-PE.fa - is generated based on index sequences by the script./scripts/write_adapters.py(was deleted earlier as it was on scratch. Needs to be re-written.) + config.yaml - comprises information including list of samples + index_table.csv - used by the script./scripts/write_adapters.pyto make indexed adapters + Mapping_summary.csv - result from the rule summarize_mapping + rulegraph.pdf - summary DAG of rules in the pipeline (made usingsnakemake --forceall --rulegraph | dot -Tpdf > Figuers/rulegraph.pdf) + Report.Rmd - this report ! + Report.html - this report compiled ! + Snakefile - the pipipeline !!!
5.2 Rules
DAG of all rules in the pipeline
rule raw_qc- This rule runs fastqc
on raw files and saves the output in
./fastqc/raw.
- This rule runs fastqc
on raw files and saves the output in
rule make_adapters- This rule uses the script
_./scripts/write_adapters.py_ which was deleted earlier. - It uses the index_table.csv files to make the Adapters-PE.fa file containing indexed adapters.
- This rule uses the script
rule trim- This rules trims reads using trimmomatic.
- The Adapters-PE.fa files is used.
- The trimming parameters are.
rule trim_qc- This rule runs fastqc
on trimmed files and saves the output in
./fastqc/trim.
- This rule runs fastqc
on trimmed files and saves the output in
rule index_bwarule index_samtools- Indexes genomes in
./databasefor use by samtools.
- Indexes genomes in
rule make_genome_list- Creats a text file corresponding to each set of genomes in
./databaseto be used when we need to know which genomes are present in given genome database.
- Creats a text file corresponding to each set of genomes in
rule host_mapping- Uses bwa
mem to map reads for each sample to a database containing host
genomes,
./database/4_host_db. - Unmapped reads identified by samtools with the option
-f4are stored in a seperate bam file. - The bam file with all alignments is used later by the counting rule and then deleted after counting.
- Uses bwa
mem to map reads for each sample to a database containing host
genomes,
rule host_mapping_extract_reads- Reads that did not map to the host database are extracted and then mapped to the microbiome database.
- They are extracted using picard.
- The option
-Xmx8gensures that java is given 8 GB memory. If suffecient memory is not allocated, the job will fail.
rule host_mapping_count- Counts the number of mapped, properly and unmapped reads from host mapping.
- It uses the following flags to identify each kind of read:
- count number of properly mapped reads:
-f 67 -F 2304- 67 (include -f) flags
- read paired (0x1)
- read mapped in proper pair (0x2)
- first in pair (0x40)
- 2308 (exclude -F) flags
- read unmapped (0x4)
- supplementary alignment (0x800)
- not primary (0x100)
- 67 (include -f) flags
- count number of mapped reads:
-f 67 -F 2304- 65 (include -f) flags
- read paired (0x1)
- first in pair (0x40)
- 2308 (exclude -F) flags
- read unmapped (0x4)
- supplementary alignment (0x800)
- not primary (0x100)
- 65 (include -f) flags
- count number of unmapped reads:
-f 67 -F 2304- 69 (include -f) flags
- read paired (0x1)
- read unmapped (0x4)
- first in pair (0x40)
- 2304 (exclude -F) flags
- supplementary alignment (0x800)
- not primary (0x100)
- 69 (include -f) flags
- count number of properly mapped reads:
- After counting, the bam file is deleted and an empty file is touched to mark that counting is complete for said file.
rule microbiomedb_mapping- The host unmapped reads extracted earlier are mapped to the microbiome database.
- Mapped reads are extracted using a perls script as follows. First,
unmapped reads are excluded using
-F4and then supplementary reads are excluded-F 0x800. Finally, the remaining reads are sent through./scripts/filter_sam_aln_length.pl. The script filters away reads that have less than 50bps matching in the alignment.
rule microbiomedb_extract_reads- Extracts mapped reads identified as mentioned in the previous rule and saves them as fastq files.
rule microbiome_mapping_count- Counts reads as explained in the other counting rule, host_mapping_count.
rule cat_and_clean_counts- Compiles all the counts into 1 file for easier parsing by the summarize rule.
rule summarize_mapping- Summarizes counts in a csv file using the results of earlier rules and by counting fastq files.
rule run_orthofinder- Runs Orthofinder for each phylotype.
- Before running this, group genomes by phylotype in
directories for Orthofinder to be able to get which groups to consider
together. When the genomes for the database are downloaded at
./database/faa_files/{genome}, they are all in one directory. Grouping was done using./scripts/rearange_faa.py. As written, it is to be run from the scripts directory in which it resides (!! it uses relative paths !!). - faa files for each genome comes from the respective databese (NCBI for example)
- When orthofinder finishes, the following file will be generated and
used for the following steps,
./database/faa_files/{phylotype}/OrthoFinder/Results_dir/Orthogroups/Orthogroups.txt. - The file Orthogroups.txt contains a list of orthogroups.
Eg, each line would look like
- OG0000003: C4S76_01365 C4S76_01370 C4S76_01375 C4S77_06100 C4S77_06130 C4S77_06135 C4S77_06775 C4S77_06780 C4S77_06785 C4S77_06790 C4S77_06795 C4S77_06800 C4S77_06805 C4S77_06810 C4S77_09595 C4S77_09600 C4S77_09605 C4S77_09610 C4S77_09615 C4S77_09620 C4S77_10540 Ga0307799_111506
- where, OG0000003 is an orthogroup for this group of genomes (phylotype) and C4S77, Ga0307799 etc. are genomes that belong to that group. 09615, 09620, 10540 are genes from C4S77 and 111506 from Ga0307799 that belong to orthogroup OG0000003.
rule get_single_ortho- The files Orthogroups.txt is parsed by
./scripts/get_single_ortho.pyand single-copy orthologs are written to./database/Orthofinder/{phylotype}_single_ortho.txt - The script reads each orthogroup and counts the number of genomes present in genes of that orthogroup. If the number of genes in the orthogroup and the number of genomes in the orthogroup are the same as the total number of genomes in the database for said phylotype, the genes in the group are considered single-copy core genes and included for core coverage estimation.
- The files Orthogroups.txt is parsed by
rule extract_orthologs- This rule prepares files with sequences of orthologs in order to calculate percentage identity (perc_id).
- First, it reads the file
./database/Orthofinder/{phylotype}_single_ortho.txtand gets all the genome-ids present in the ortholog-file, and all the gene-ids associated with each gene-family. Using this list it extracts and stores the sequences of each of the genes of an orthogroup in an faa file and ffn file corresponding to each group in the directory./database/Orthofinder/{phylotype}_ortho_sequences/. EXAMPLE: cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034.faa
>Ga0061073_1479
MTKYQTLIFVPEGSLLNEKTAEQVALRQTLKELGHDFGPAERLKYSSLQGQVKMMGFSER
IALTLQNFCTDDLAEAEKIFKTKLGGQRQLVKDAIPFLDQITNQVKLILLAKEERELISA
RLSDSELLNYFSASYFKEDFADPLPNKNVLFQIIKEQELDPDNCLVIGTDLVEEIQGAEN
AGLQSLWIAPKKVKMPISPRPTLHLTKLNDLLFYLELN
>Ga0070887_12184
MKGKVHLAKYETLIFILEGSLLNEKVAEQNALRQTLKLTGREYGPAERIQYNSLQEKIKL
LGFDERIKLTLQEFFKNDWISAKGTFYNQLQKQDQLNKDVVPFLDEVKNKVNLVLLTKEK
KDVASYRMQNTELINYFSAVYFKDDFACKFPNKKVLITILQQQNLTPATCLVIGTNLVDE
IQGAENANLDSLWLAPKKVKMPISPRPTLHLNKLTDLLFYLELS
>Ga0072400_11133
LAKFQTLIFILEGSLLDEKIAEQSALKQTLKSTGRDFGPSERLKYNSVRENNKLLGFEDR
IQLILQTFFHENWQDAGQIFIKELQKQNRLNKEVLPFLNKVNCKVKLILLAKENKKVALQ
RMKNTELVNYFPFAYFKDDFTEKLPHKKVLTTILQKQNLAFATSLVIGTDLADEIQAAEN
AKIQSLWLAPKKVKMPISPHPTLHLNKLNDLLFYLELScat ./database/Orthofinder/firm5_ortho_sequences/OG0001034.ffa
>Ga0061073_1479
GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGTAGTTTATTAAATGAAAAAACG
GCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTCGGACATGATTTTGGACCAGCT
GAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAAATGATGGGTTTCAGCGAGCGC
ATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTGGCTGAGGCCGAAAAAATTTTC
AAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGATGCTATTCCATTTCTTGACCAA
ATAACAAACCAAGTTAAGCTAATTCTCCTTGCCAAAGAAGAACGTGAACTAATCTCAGCT
CGCCTATCTGATAGCGAACTACTTAACTATTTTTCTGCTTCCTATTTTAAAGAAGATTTT
GCTGATCCTTTGCCAAATAAAAATGTCCTGTTTCAAATTATAAAAGAGCAAGAATTAGAT
CCAGATAATTGCCTAGTTATCGGCACAGATTTAGTTGAAGAAATTCAAGGAGCAGAAAAC
GCTGGCTTGCAATCATTATGGATTGCACCAAAAAAGGTTAAAATGCCAATTAGTCCTCGA
CCTACTCTGCATTTAACTAAACTCAATGACTTGCTTTTTTATCTTGAATTAAACTAG
>Ga0070887_12184
ATGAAAGGAAAAGTACACTTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGAAGC
TTATTAAACGAAAAAGTTGCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACTGGC
AGAGAATATGGTCCAGCTGAGCGCATACAATATAATTCATTACAAGAAAAGATTAAATTA
CTAGGATTTGATGAGCGCATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGGATT
TCTGCGAAAGGCACTTTTTATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGATGTA
GTGCCCTTTTTAGATGAGGTGAAAAACAAAGTTAACTTGGTTTTGCTGACGAAAGAGAAA
AAAGATGTGGCTTCATACCGCATGCAAAATACAGAGCTAATAAATTATTTTTCCGCAGTT
TATTTTAAAGACGATTTTGCATGTAAGTTTCCAAATAAGAAGGTTTTGATAACAATATTG
CAGCAGCAGAATCTGACGCCAGCCACTTGTCTTGTAATTGGGACAAACTTAGTCGATGAA
ATTCAGGGTGCCGAAAATGCTAACCTGGATTCTCTTTGGCTAGCGCCCAAGAAAGTAAAA
ATGCCAATTAGTCCACGTCCAACTTTACATTTAAATAAATTAACTGATTTATTATTTTAC
CTAGAATTAAGCTAG
>Ga0072400_11133
TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGCAGTTTATTAGATGAAAAGATT
GCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACTGGCAGAGATTTTGGTCCCAGT
GAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAGTTGCTTGGCTTTGAAGACCGC
ATACAATTAATTTTACAAACATTTTTTCATGAAAATTGGCAAGATGCAGGGCAGATTTTT
ATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAAGTATTGCCATTTTTAAACAAA
GTTAACTGCAAGGTTAAACTAATTCTGCTGGCAAAAGAGAACAAAAAAGTAGCATTACAG
CGCATGAAGAACACAGAGTTGGTAAATTATTTTCCGTTTGCTTATTTTAAAGATGACTTT
ACGGAAAAATTGCCACATAAAAAAGTTTTGACCACCATTTTGCAGAAACAAAACTTGGCG
TTCGCAACTAGTTTAGTAATCGGAACTGACTTAGCAGATGAAATTCAGGCTGCAGAGAAT
GCCAAAATACAGTCACTCTGGCTAGCGCCTAAGAAAGTAAAAATGCCGATTAGCCCGCAC
CCAACTTTACATTTAAATAAATTAAACGATTTATTATTTTACCTAGAATTAAGCTAG
rule calc_perc_id- this rule relies on various tools and scripts tied together by
./scripts/aln_calc.sh. The scripts are:- mafft
for alignment
- The aligned result is in a multi-fasta file called {OrthogroupID}_aln.fasta. EXAMPLE:
cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln.fasta
>Ga0061073_1479
——-MTKYQTLIFVPEGSLLNEKTAEQVALRQTLKELGHDFGPAERLKYSSLQGQVK
MMGFSERIALTLQNFCTDDLAEAEKIFKTKLGGQRQLVKDAIPFLDQIT—NQVKLILL
AKEERELISARLSDSELLNYFSASYFKEDFADPLPNKNVLFQIIKEQELDPDNCLVIGTD
LVEEIQGAENAGLQSLWIAPKKVKMPISPRPTLHLTKLNDLLFYLELN
>Ga0070887_12184
M-KGKVHLAKYETLIFILEGSLLNEKVAEQNALRQTLKLTGREYGPAERIQYNSLQEKIK
LLGFDERIKLTLQEFFKNDWISAKGTFYNQLQKQDQLNKDVVPFLDEVK—NKVNLVLL
TKEKKDVASYRMQNTELINYFSAVYFKDDFACKFPNKKVLITILQQQNLTPATCLVIGTN
LVDEIQGAENANLDSLWLAPKKVKMPISPRPTLHLNKLTDLLFYLELS
>Ga0072400_11133
——-LAKFQTLIFILEGSLLDEKIAEQSALKQTLKSTGRDFGPSERLKYNSVRENNK
LLGFEDRIQLILQTFFHENWQDAGQIFIKELQKQNRLNKEVLPFLNKVN—CKVKLILL
AKENKKVALQRMKNTELVNYFPFAYFKDDFTEKLPHKKVLTTILQKQNLAFATSLVIGTD
LADEIQAAENAKIQSLWLAPKKVKMPISPHPTLHLNKLNDLLFYLELS
- aln_aa_to_dna.py
- This scripts converts the alignments into nucleotide sequences rather than amino acid sequences EXAMPLE:
cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln.fasta>Ga0061073_1479
———————GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGT
AGTTTATTAAATGAAAAAACGGCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTC
GGACATGATTTTGGACCAGCTGAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAA
ATGATGGGTTTCAGCGAGCGCATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTG
GCTGAGGCCGAAAAAATTTTCAAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGAT
GCTATTCCATTTCTTGACCAAATAACA———AACCAAGTTAAGCTAATTCTCCTT
GCCAAAGAAGAACGTGAACTAATCTCAGCTCGCCTATCTGATAGCGAACTACTTAACTAT
TTTTCTGCTTCCTATTTTAAAGAAGATTTTGCTGATCCTTTGCCAAATAAAAATGTCCTG
TTTCAAATTATAAAAGAGCAAGAATTAGATCCAGATAATTGCCTAGTTATCGGCACAGAT
TTAGTTGAAGAAATTCAAGGAGCAGAAAACGCTGGCTTGCAATCATTATGGATTGCACCA
AAAAAGGTTAAAATGCCAATTAGTCCTCGACCTACTCTGCATTTAACTAAACTCAATGAC
TTGCTTTTTTATCTTGAATTAAAC
>Ga0070887_12184
ATG—AAAGGAAAAGTACACTTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGA
AGCTTATTAAACGAAAAAGTTGCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACT
GGCAGAGAATATGGTCCAGCTGAGCGCATACAATATAATTCATTACAAGAAAAGATTAAA
TTACTAGGATTTGATGAGCGCATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGG
ATTTCTGCGAAAGGCACTTTTTATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGAT
GTAGTGCCCTTTTTAGATGAGGTGAAA———AACAAAGTTAACTTGGTTTTGCTG
ACGAAAGAGAAAAAAGATGTGGCTTCATACCGCATGCAAAATACAGAGCTAATAAATTAT
TTTTCCGCAGTTTATTTTAAAGACGATTTTGCATGTAAGTTTCCAAATAAGAAGGTTTTG
ATAACAATATTGCAGCAGCAGAATCTGACGCCAGCCACTTGTCTTGTAATTGGGACAAAC
TTAGTCGATGAAATTCAGGGTGCCGAAAATGCTAACCTGGATTCTCTTTGGCTAGCGCCC
AAGAAAGTAAAAATGCCAATTAGTCCACGTCCAACTTTACATTTAAATAAATTAACTGAT
TTATTATTTTACCTAGAATTAAGC
>Ga0072400_11133
———————TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGC
AGTTTATTAGATGAAAAGATTGCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACT
GGCAGAGATTTTGGTCCCAGTGAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAG
TTGCTTGGCTTTGAAGACCGCATACAATTAATTTTACAAACATTTTTTCATGAAAATTGG
CAAGATGCAGGGCAGATTTTTATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAA
GTATTGCCATTTTTAAACAAAGTTAAC———TGCAAGGTTAAACTAATTCTGCTG
GCAAAAGAGAACAAAAAAGTAGCATTACAGCGCATGAAGAACACAGAGTTGGTAAATTAT
TTTCCGTTTGCTTATTTTAAAGATGACTTTACGGAAAAATTGCCACATAAAAAAGTTTTG
ACCACCATTTTGCAGAAACAAAACTTGGCGTTCGCAACTAGTTTAGTAATCGGAACTGAC
TTAGCAGATGAAATTCAGGCTGCAGAGAATGCCAAAATACAGTCACTCTGGCTAGCGCCT
AAGAAAGTAAAAATGCCGATTAGCCCGCACCCAACTTTACATTTAAATAAATTAAACGAT
TTATTATTTTACCTAGAATTAAGC
- trim_aln.py and
sedto simplify headers to contain just genome ID and leave out gene identifier (as they are all single copy core genes).- This script trims out all the sections that do not align by counting which positions have “-” and removing those from all the members of the orthogroup.
cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln_trim.fasta
>Ga0061073
GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGTAGTTTATTAAATGAAAAAACG
GCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTCGGACATGATTTTGGACCAGCT
GAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAAATGATGGGTTTCAGCGAGCGC
ATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTGGCTGAGGCCGAAAAAATTTTC
AAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGATGCTATTCCATTTCTTGACCAA
ATAACAAACCAAGTTAAGCTAATTCTCCTTGCCAAAGAAGAACGTGAACTAATCTCAGCT
CGCCTATCTGATAGCGAACTACTTAACTATTTTTCTGCTTCCTATTTTAAAGAAGATTTT
GCTGATCCTTTGCCAAATAAAAATGTCCTGTTTCAAATTATAAAAGAGCAAGAATTAGAT
CCAGATAATTGCCTAGTTATCGGCACAGATTTAGTTGAAGAAATTCAAGGAGCAGAAAAC
GCTGGCTTGCAATCATTATGGATTGCACCAAAAAAGGTTAAAATGCCAATTAGTCCTCGA
CCTACTCTGCATTTAACTAAACTCAATGACTTGCTTTTTTATCTTGAATTAAAC
>Ga0070887
TTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGAAGCTTATTAAACGAAAAAGTT
GCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACTGGCAGAGAATATGGTCCAGCT
GAGCGCATACAATATAATTCATTACAAGAAAAGATTAAATTACTAGGATTTGATGAGCGC
ATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGGATTTCTGCGAAAGGCACTTTT
TATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGATGTAGTGCCCTTTTTAGATGAG
GTGAAAAACAAAGTTAACTTGGTTTTGCTGACGAAAGAGAAAAAAGATGTGGCTTCATAC
CGCATGCAAAATACAGAGCTAATAAATTATTTTTCCGCAGTTTATTTTAAAGACGATTTT
GCATGTAAGTTTCCAAATAAGAAGGTTTTGATAACAATATTGCAGCAGCAGAATCTGACG
CCAGCCACTTGTCTTGTAATTGGGACAAACTTAGTCGATGAAATTCAGGGTGCCGAAAAT
GCTAACCTGGATTCTCTTTGGCTAGCGCCCAAGAAAGTAAAAATGCCAATTAGTCCACGT
CCAACTTTACATTTAAATAAATTAACTGATTTATTATTTTACCTAGAATTAAGC
>Ga0072400
TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGCAGTTTATTAGATGAAAAGATT
GCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACTGGCAGAGATTTTGGTCCCAGT
GAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAGTTGCTTGGCTTTGAAGACCGC
ATACAATTAATTTTACAAACATTTTTTCATGAAAATTGGCAAGATGCAGGGCAGATTTTT
ATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAAGTATTGCCATTTTTAAACAAA
GTTAACTGCAAGGTTAAACTAATTCTGCTGGCAAAAGAGAACAAAAAAGTAGCATTACAG
CGCATGAAGAACACAGAGTTGGTAAATTATTTTCCGTTTGCTTATTTTAAAGATGACTTT
ACGGAAAAATTGCCACATAAAAAAGTTTTGACCACCATTTTGCAGAAACAAAACTTGGCG
TTCGCAACTAGTTTAGTAATCGGAACTGACTTAGCAGATGAAATTCAGGCTGCAGAGAAT
GCCAAAATACAGTCACTCTGGCTAGCGCCTAAGAAAGTAAAAATGCCGATTAGCCCGCAC
CCAACTTTACATTTAAATAAATTAAACGATTTATTATTTTACCTAGAATTAAGC
- calc_perc_id_orthologs.py
- Uses as input, trimmed aligned sequences and a metafile
(
./database/genome_db_210402_metafile.txt) which is a tab-delim file with genome-id in tab1 and SDP-affiliation in tab 3 - First, it checks the number of SDPs contained within the alignment. If more than one, it continues by calculating alignment percentage identity stats across SDPs. If only one SDP, exits script.
- Next, it Compares the genomes in each SDP to all other genomes in
the alignment: calculates percentage identity for all pairwise
combinations. Calculates the max, min, and mean values, prints to file
./database/Orthofinder/{phylotype}_perc_id.txtshowing one orthogroup per line. EXAMPLE: cat ./database/Orthofinder/firm5_perc_id.txt
…
OG0001034 0.674 0.586 0.972
…
- Uses as input, trimmed aligned sequences and a metafile
(
- mafft
for alignment
- this rule relies on various tools and scripts tied together by
rule filter_orthogroups- orthogroups are filtered based on:
- Minimum gene-length 300bp (applied to all members of each gene-family)
- Inter-SDP max alignment identity 95% (only if the phylotype contain multiple SDPs)
- Short genes are filtered off, because they are likley to be less reliable for accurate coverage estimates. Similarly, the inter-SDP similarity threshold is used to ensure that there is enough divergence between the SDPs for reliable mapping (at least as estimated from the currently availabe genomes). It is worthwhile to check the number of gene-families before/after filtering. If a lot of gene-families were filtered off, this could be an indication that the SDPs are not properly discrete.
- This finally results in the single-copy core genes that have been
filtered to be used for core coverage estimation.
./database/Orthofinder/{phylotype}_single_ortho_filt.txt.
- orthogroups are filtered based on:
rule core_cov- takes as input bam files with the alignments for each sample to be
considered (as a text file containing a list of these files) and the
_single_ortho_filt file. Outputs are written to
./04_CoreCov_"+ProjectIdentifier+"/{phylotype}_corecov.txt. - The script reads the filtered orthofile
./database/Orthofinder/{phylotype}_single_ortho_filt.txtand gets the gene-famililes and genome-ids for each SDP. - Then, from bed files, it finds the start and end positions of each
of the genes in an orthogroup for each of the genomes of the orthogroup.
It writes these to the file,
./04_CoreCov_*/{phylotype}_corecov.txteach SDP in the phylotype, start position for each gene family in the genome marked reference for that SDP. - The coverage is also written to this file. Example:
- SDP Sample OG Ref_pos Coverage
firm5_1 DrY1_N1_microbiome_mapped OG0000932 448 18.81
firm5_1 DrY1_N1_microbiome_mapped OG0000931 1991 23.34
…
firm5_2 M1.5_microbiome_mapped OG0000935 1852405 12.29
firm5_2 M1.5_microbiome_mapped OG0000934 1853270 9.95
firm5_3 DrY1_N1_microbiome_mapped OG0000932 1 501.61
firm5_3 DrY1_N1_microbiome_mapped OG0000931 1542 534.77
…
firm5_bombus M1.5_microbiome_mapped OG0000936 1674767 0.0
firm5_bombus M1.5_microbiome_mapped OG0000935 1676256 0.0
firm5_bombus M1.5_microbiome_mapped OG0000934 1677124 0.0
- SDP Sample OG Ref_pos Coverage
- SDP abundance is estimated based on mapped read coverage of core genes. It sums up gene coverages of all the genes og OG families associated with said SDP across genomes belogining to the SDP.
- It also reports PTR (Peak-Trough Ratio).
- Most species in the database are represented by multiple genomes (< 98.5% gANI between genomes). Core genes are inferred at the phylotype. More accurate estimates can be obtained by using a large number (+700) of core genes.
- takes as input bam files with the alignments for each sample to be
considered (as a text file containing a list of these files) and the
_single_ortho_filt file. Outputs are written to
rule core_cov_plots- This R-script will estimate the coverage at the terminus, using the summed core gene family coverages. If the cov-ter cannot be properly estimated (fx. due to draft genome status or lack of replication), an estimate will be generated using the median coverage across core gene families, and the PTR is set to NA. If more than 20% of the core gene families have no coverage, the abundance will be set to zero. As output, a tabular file is generated (including the cov-ter/median cov, and PTR), and a pdf-file with plots for visual validation.
- First, filter for samples with coverage of at least 1 on > 80% of the core genes. Next, values that are deviating no more than 2 times the median are kept others are discarded as outliers.
- Next, gets fitted coordinates and append values to coord-table. It does this by using the segmented package. As explained below,
where, lin.mod is a simple linear model that was made by
base R. The R package Segmented supports breakpoint
analysis. The methods used by this package are applicable when segments
are (nearly continuous) so this means that for the regression to make
sense the core gene families selected should cover the reference genome
well and without too many huge gaps. psi, is a starting
value of the breakpoint. Example of a model fit using segemented,
x is the slope of the first segment and
U1.x is the difference in slopes between the first and
second segment. psi_est is the newly estimated breakpoint.
This along with the slopes - The summary function shows:
Finally, from the segmented model the ptr is calculated as follows:
For cor_ter is the coverage \(y = ax + b\) where x is psi (the breakpoint
on the x axis) and the cor_ori2 is the coverage at the ori
which is the section with maximum coverage and at the tail.
The condition
(psi<psi_min) || (psi>psi_max) || (min_ori_cov<cov_ter)
checks: First; If the break-point is too far from the
expected place (+/-50% of break-point estimate), ptr is set to
NA. Second; If the coverage at ori (either
beginning or end of dataframe) is lower than the estimated coverage at
ter, ptr is also set to NA. Finally, if the coverage of the
origin is not greater than the terminus, ptr is set to
NA.
the PTR was set to
NA, the median will be plotted and used for quantification. Else, the segmented regression line is plotted, and the terminus coverage is used for quantification.rule assemble_host_unmapped
- Takes as input the R1 and R2 reads that were not mapped to the host
and assembles them using spades with the
--metatag and default parameters. - Memory allocation is not obvious. More documentation on this soon.
- Takes as input the R1 and R2 reads that were not mapped to the host
and assembles them using spades with the
rule map_to_assembly
- Map reads that were assembled against the contigs that they were assembled into using bwa mem.
rule cat_and_clean_counts_assembly
- Compiles counts into one file for summarizing
rule summarize_mapping_assembly
- similar to earlier rule “summarize_mapping”
rule backmapping
- NxN mapping for
rule merge_depths
rule binning
rule process_metabat2
rule checkm_evaluation
rule prepare_info_for_drep
rule drep
rule gtdb_annotate
rule compile_report
rule backup
rule onsuccess
5.3 Scripts
aln_aa_to_dna.pyaln_calc.shcalc_perc_id_orthologs.pycore_cov.pycore_cov.Rdownload_data.pyextract_orthologs.pyfasta_generate_regions.pyfilter_bam.pyfilter_orthologs.pyfilter_sam_aln_length.plfilter_sam_aln_length_unmapped.plfilter_snvs.plfilt_vcf_samples.plget_single_ortho.pyparse_core_cov.pyparse_spades_metagenome.plrearange_faa.pysubset_orthofile.pytrim_aln.py./scripts/write_adapters.py
5.4 Envs
core-cov-env.yaml
mapping-env.yaml
rmd-env.yaml
snv-env.yaml
trim-qc-env.yaml
6 Next steps
It is clear that the database is not best suited for some SDPs found especially in host species other than Apis mellifera. So, the next step would be to implement a MAG based analysis to compare these samples. However, as the database was already shown to be well-suited for Apis mellifera and Apis cerana, another set of analysis would compare these samples from the publication (zenodo) with the samples from India.
6.1 Choosing representative genomes
Information in drep documentation.
\[A * Completeness - B * Contamination + C * (Contamination * frac{strainheterogeneity}{100}) + D * log(N50) + E * log(size) + F * (centrality - S_{a}ni) \]s
“It makes no sense to perform bootstrap with less than 4 sequences” from IQTree
7 Results
7.1 Total number of reads before and after trimming
Total reads by species
7.2 Total number of reads per species after trimming
Total reads by species
7.3 Proportion of reads mapping to host and microbiome database
Proportion of reads mapped to the microbiome database and host database
7.4 Micorbiome reads lost by non_specific mapping to host
Proportion of reads mapped to the microbiome database and host database
7.5 Community profiling with mOTUs
mOTUs Summary
8 MAG analysis
8.1 Mapping to Assembly
The host-filtered reads were assembled. The proportion of reads that mapped back to the assembly and some information about the assemblies such as assembly size, and information about contigs are shown below.
## TableGrob (3 x 1) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[arrange]
## 2 2 (2-2,1-1) arrange gtable[guide-box]
## 3 3 (3-3,1-1) arrange gtable[arrange]